st 1
Constrained Feedback Learning for Non-Stationary Multi-Armed Bandits
Non-stationary multi-armed bandits (NSMAB) enable agents to adapt to changing environments by incorporating mechanisms to detect and respond to shifts in reward distributions, making them well-suited for dynamic settings. However, existing approaches typically assume that reward feedback is available at every round--an assumption that overlooks many real-world scenarios where feedback is limited. In this paper, we take a significant step forward by introducing a new model of constrained feedback in non-stationary multi-armed bandits (CONFEE-NSMAB), where the availability of reward feedback is restricted. We propose the first priorfree algorithm--that is, one that does not require prior knowledge of the degree of non-stationarity--that achieves near-optimal dynamic regret in this setting. Specifically, our algorithm attains a dynamic regret of O(K1/3V1/3TT/B1/3), where T is the number of rounds, K is the number of arms, B is the query budget, and VT is the variation budget capturing the degree of non-stationarity.
Orthogonal Survival Learners for Estimating Heterogeneous Treatment Effects from Time-to-Event Data
Estimating heterogeneous treatment effects (HTEs) is crucial for personalized decision-making. However, this task is challenging in survival analysis, which includes time-to-event data with censored outcomes (e.g., due to study dropout). In this paper, we propose a toolbox of orthogonal survival learners to estimate HTEs from time-to-event data under censoring. Our learners have three main advantages: (i) we show that learners from our toolbox are guaranteed to be orthogonal and thus robust with respect to nuisance estimation errors; (ii) our toolbox allows for incorporating a custom weighting function, which can lead to robustness against different types of low overlap, and (iii) our learners are modelagnostic (i.e., they can be combined with arbitrary machine learning models). We instantiate the learners from our toolbox using several weighting functions and, as a result, propose various neural orthogonal survival learners. Some of these coincide with existing survival learners (including survival versions of the DRand R-learner), while others are novel and further robust w.r.t.
Doubly Outlier-Robust Online Infinite Hidden Markov Model
Yiu, Horace, Sánchez-Betancourt, Leandro, Cartea, Álvaro, Duran-Martin, Gerardo
We derive a robust update rule for the online infinite hidden Markov model (iHMM) for when the streaming data contains outliers and the model is misspecified. Leveraging recent advances in generalised Bayesian inference, we define robustness via the posterior influence function (PIF), and provide conditions under which the online iHMM has bounded PIF. Imposing robustness inevitably induces an adaptation lag for regime switching. Our method, which is called Batched Robust iHMM (BR-iHMM), balances adaptivity and robustness with two additional tunable parameters. Across limit order book data, hourly electricity demand, and a synthetic high-dimensional linear system, BR-iHMM reduces one-step-ahead forecasting error by up to 67% relative to competing online Bayesian methods. Together with theoretical guarantees of bounded PIF, our results highlight the practicality of our approach for both forecasting and interpretable online learning.
Breaking the $O(\sqrt{T})$ Cumulative Constraint Violation Barrier while Achieving $O(\sqrt{T})$ Static Regret in Constrained Online Convex Optimization
Balasundaram, Haricharan, Mahendran, Karthick Krishna, Vaze, Rahul
The problem of constrained online convex optimization is considered, where at each round, once a learner commits to an action $x_t \in \mathcal{X} \subset \mathbb{R}^d$, a convex loss function $f_t$ and a convex constraint function $g_t$ that drives the constraint $g_t(x)\le 0$ are revealed. The objective is to simultaneously minimize the static regret and cumulative constraint violation (CCV) compared to the benchmark that knows the loss functions and constraint functions $f_t$ and $g_t$ for all $t$ ahead of time, and chooses a static optimal action that is feasible with respect to all $g_t(x)\le 0$. In recent prior work Sinha and Vaze [2024], algorithms with simultaneous regret of $O(\sqrt{T})$ and CCV of $O(\sqrt{T})$ or (CCV of $O(1)$ in specific cases Vaze and Sinha [2025], e.g. when $d=1$) have been proposed. It is widely believed that CCV is $Ω(\sqrt{T})$ for all algorithms that ensure that regret is $O(\sqrt{T})$ with the worst case input for any $d\ge 2$. In this paper, we refute this and show that the algorithm of Vaze and Sinha [2025] simultaneously achieves regret of $O(\sqrt{T})$ regret and CCV of $O(T^{1/3})$ when $d=2$.
e7663e974c4ee7a2b475a4775201ce1f-Supplemental-Conference.pdf
The key challenge in making this connection is grounding the skills, so that each skill corresponds to a specific goal-conditioned policy. We start by recalling the definition of the discounted state occupancymeasure(Eq.3): p(st+=sg)=(1 γ) X On the second line, we havechanged the bounds of the summation to start at 0, and changed the terms inside the summation accordingly. On the third line, we applied linearity of expectation to movethesummation insidetheexpectation. Onthefourthline,weappliedlinearity ofexpectation again to move the term fort = 0 inside the expectation. Finally, we substituted the definition of rg(s,a)toobtainthedesiredresult. This result means that we are doing policy improvement with approximate Q-values.